feat(rust): Add RLE to `RLE_DICTIONARY` encoder #15959

thalassemia · 2024-04-29T22:20:35Z

The RLE_DICTIONARY encoder currently only supports bit-packing. After this PR, the encoder will switch between bit-packing and RLE depending on whether a run is longer than 8 (always bit-pack a multiple of 8 values at a time).

This addresses the specific case pointed out in #10680. With these changes, I get 2745 bytes for pl_zstd.pq and 2962 bytes for pa_zstd.pq after running the following code:

from random import randint
import polars as pl
rand_value = randint(1, 10**15)
df = pl.DataFrame({
    'A': [rand_value for _ in range(10_000_000)],
}, schema={
    'A': pl.UInt64,
})
df.write_parquet('pa_zstd.pq', compression='zstd', use_pyarrow=True)
# Match PyArrow default row group size
df.write_parquet('pl_zstd.pq', compression='zstd', use_pyarrow=False, row_group_size=1024**2)

At first, the additional logic did slow down the encoder significantly. To address this, I did some profiling and made two optimizations.

The encoder begins by casting the input array to a dictionary array. Currently, the dictionary is created with u32 keys and downcasted to u16 if possible. I removed this downcasting step because it seemed unnecessary (f7e803a).
When populating the dictionary array, I noticed that a lot of time is spent in memcpy because the key vector is grown one value at a time. I resolved this by reserving the right vector size from the start (14fe89c).

After these optimizations, the RLE_DICTIONARY encoder performs on par with or only slightly worse than use_pyarrow=True in most cases (faster than current encoding performance). However, the performance can be up to 50% worse than use_pyarrow=True (matching current encoding performance) with high cardinality data that frequently switches between RLE and bit-packing:

import itertools
import polars as pl
# Performance degrades with frequent switches between RLE and bit-packing
t = pl.DataFrame({'a': itertools.chain.from_iterable([[i] * 9 + [i+1] * 8 for i in range(100000)])})

Some other changes:

Allow nested types to use RLE_DICTIONARY encoding (9c73a61). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:

nested_tbl = pl.DataFrame({'a': [[0] * 8 + [2] * 9 + [1] * 8 + [3] * 9] * 5000000})
int_tbl = pl.DataFrame({'a': ([0] * 8 + [2] * 9 + [1] * 8 + [3] * 9) * 5000000})
# 12 seconds, most of which is spent casting to dictionary array
nested_tbl.write_parquet('tmp.pq')
# 3 seconds
int_tbl.write_parquet('tmp.pq')

Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (d38e1f4).

This does not fully resolve the linked issue because it still does not provide users any way of manually specifying encodings like in PyArrow.

codecov · 2024-04-29T22:44:13Z

Codecov Report

Attention: Patch coverage is 97.74011% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 80.93%. Comparing base (f0dbb6a) to head (b50a82a).
Report is 37 commits behind head on main.

❗ Current head b50a82a differs from pull request most recent head 4f06b9e. Consider uploading reports for the commit 4f06b9e to get more accurate results

Files	Patch %	Lines
crates/polars-arrow/src/compute/cast/binary_to.rs	0.00%	1 Missing ⚠️
crates/polars-arrow/src/compute/cast/binview_to.rs	50.00%	1 Missing ⚠️
crates/polars-arrow/src/compute/cast/utf8_to.rs	0.00%	1 Missing ⚠️
...parquet/src/parquet/encoding/hybrid_rle/encoder.rs	99.37%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15959      +/-   ##
==========================================
- Coverage   81.29%   80.93%   -0.36%     
==========================================
  Files        1381     1385       +4     
  Lines      176876   178291    +1415     
  Branches     3043     3050       +7     
==========================================
+ Hits       143789   144307     +518     
- Misses      32604    33493     +889     
- Partials      483      491       +8

Flag	Coverage Δ
python	`74.47% <96.04%> (-0.28%)`	⬇️
rust	`78.15% <97.74%> (-0.26%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

ritchie46 · 2024-05-01T13:14:32Z

Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (d38e1f4).

Hmm.. Do we actually compress those v1 pages then? 🤔

Allow nested types to use RLE_DICTIONARY encoding (9c73a61). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:

I don't think we should activate it for nested types then.

ritchie46

Thank you for the PR. Interesting improvement. I've left some comments.

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

ritchie46 · 2024-05-01T13:28:28Z

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

+    Ok(())
+}
+
+#[allow(clippy::comparison_chain)]
 pub fn encode_bool<W: Write, I: Iterator<Item = bool>>(


This is the same logic as encode_u32, can we make a single generic and dispatch this function to that.

I'm very new to Rust and took a while to figure this out. Not sure if I implemented this correctly.

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

thalassemia · 2024-05-03T03:04:26Z

Hmm.. Do we actually compress those v1 pages then? 🤔

Yup, here's the relevant code.

polars/crates/polars-parquet/src/parquet/write/compression.rs

Lines 23 to 38 in 864e750

    
           match &header { 
        
               DataPageHeader::V1(_) => { 
        
                   compression::compress(compression, &buffer, &mut compressed_buffer)?; 
        
               }, 
        
               DataPageHeader::V2(header) => { 
        
                   let levels_byte_length = (header.repetition_levels_byte_length 
        
                       + header.definition_levels_byte_length) 
        
                       as usize; 
        
                   compressed_buffer.extend_from_slice(&buffer[..levels_byte_length]); 
        
                   compression::compress( 
        
                       compression, 
        
                       &buffer[levels_byte_length..], 
        
                       &mut compressed_buffer, 
        
                   )?; 
        
               }, 
        
           };

I don't think we should activate it for nested types then.

That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.

Thank you for the review!

ritchie46 · 2024-05-03T06:28:36Z

That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.

Ok, I think there is value in this, so I will pick up those points and get this in.

ritchie46 · 2024-05-03T14:42:39Z

Thank you for the PR @thalassemia

This reverts commit 6730a72.

ritchie46 · 2024-05-08T10:06:59Z

We had to revert this because of #16109

I do think these changes were interesting, so if you or anyone else has time to find the cause, that'd be appreciated.

thalassemia · 2024-05-08T16:32:22Z

@ritchie46 So sorry for this and thank you @nameexhaustion for fixing it! I should've written a test with more random, realistic data. Since this feature still sounds worthwhile, I'll look into this and include more robust tests for my next PR.

thalassemia requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners April 29, 2024 22:20

github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Apr 29, 2024

ritchie46 reviewed May 1, 2024

View reviewed changes

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs Outdated Show resolved Hide resolved

ritchie46 reviewed May 1, 2024

View reviewed changes

thalassemia and others added 4 commits May 2, 2024 19:55

feat(rust): Add RLE to RLE_DICTIONARY encoding

30d4690

feat(rust): Remove dictionary downcast to u16

ac21f8f

feat(rust): Reserve dictionary key vector during cast

79f4c3f

feat(rust): Use Parquet V1 data page to compress rep levels

b50a82a

thalassemia force-pushed the rle branch from d38e1f4 to b50a82a Compare May 3, 2024 02:58

add slow tag

4f06b9e

ritchie46 merged commit 6730a72 into pola-rs:main May 3, 2024
22 of 24 checks passed

nameexhaustion added a commit that referenced this pull request May 8, 2024

Revert "feat(rust): Add RLE to RLE_DICTIONARY encoder (#15959)"

447681c

This reverts commit 6730a72.

nameexhaustion mentioned this pull request May 8, 2024

fix(rust): Revert "Add RLE to RLE_DICTIONARY encoder" #16113

Merged

thalassemia mentioned this pull request May 9, 2024

feat(rust,python): Add run-length encoding to Parquet writer #16125

Merged

s-banach mentioned this pull request May 14, 2024

Nulls are not encoded efficiently #16201

Closed

2 tasks

owenprough-sift mentioned this pull request May 15, 2024

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): Add RLE to `RLE_DICTIONARY` encoder #15959

feat(rust): Add RLE to `RLE_DICTIONARY` encoder #15959

thalassemia commented Apr 29, 2024

codecov bot commented Apr 29, 2024 •

edited

Loading

ritchie46 commented May 1, 2024

ritchie46 left a comment

ritchie46 May 1, 2024

thalassemia May 3, 2024

thalassemia commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 8, 2024 •

edited

Loading

thalassemia commented May 8, 2024

feat(rust): Add RLE to RLE_DICTIONARY encoder #15959

feat(rust): Add RLE to RLE_DICTIONARY encoder #15959

Conversation

thalassemia commented Apr 29, 2024

codecov bot commented Apr 29, 2024 • edited Loading

Codecov Report

ritchie46 commented May 1, 2024

ritchie46 left a comment

Choose a reason for hiding this comment

ritchie46 May 1, 2024

Choose a reason for hiding this comment

thalassemia May 3, 2024

Choose a reason for hiding this comment

thalassemia commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 8, 2024 • edited Loading

thalassemia commented May 8, 2024

feat(rust): Add RLE to `RLE_DICTIONARY` encoder #15959

feat(rust): Add RLE to `RLE_DICTIONARY` encoder #15959

codecov bot commented Apr 29, 2024 •

edited

Loading

ritchie46 commented May 8, 2024 •

edited

Loading